Endangered Data for Endangered Languages: Digitizing Print dictionaries
نویسندگان
چکیده
This paper describes on-going work in dictionary digitization, and in particular the processing of OCRed text into a structured lexicon. The description is at a conceptual level, without implementation details. In decades of work on endangered languages, hundreds (or more) languages have been documented with print dictionaries. Into the 1980s, most such dictionaries were edited on paper media (such as 3x5 cards), then typeset by hand or on old computer systems (Bartholomew and Schoenhals. 1983; Grimes 1970). SIL International, for example, has nearly 100 lexicons that date from their work during the period 1937–1983 (Verna Stutzman, p.c.). More recently, most dictionaries are prepared on computers, using tools like SIL’s Shoebox (later Toolbox) or Fieldworks Language Explorer (FLEx). These born-digital dictionaries were all at one time on electronicmedia: tapes, floppy diskettes, hard disks or CDs. In some cases those media are no longer readable, and no backups were made onto more durable media; so the only readable version we have of these dictionaries may be a paper copy (cf. Bird and Simons 2003; Borghoff et al. 2006). And while paper copies preserve their information (barring rot, fire, and termites), that information is inaccessible to computers. For that, the paper dictionary must be digitized. A great many other dictionaries of non-endangered languages are also available only in paper form. It might seem that digitization is simple. It is not. There are two approaches to digitization: keying in the text by hand, and Optical Character Recognition (OCR). While each has advantages and disadvantages, in the end we are faced with three problems:
منابع مشابه
Creating Lexical Resources for Endangered Languages
This paper examines approaches to generate lexical resources for endangered languages. Our algorithms construct bilingual dictionaries and multilingual thesauruses using public Wordnets and a machine translator (MT). Since our work relies on only one bilingual dictionary between an endangered language and an “intermediate helper” language, it is applicable to languages that lack many existing r...
متن کاملWaldayu and Waldayu Mobile: Modern digital dictionary interfaces for endangered languages
We introduce Waldayu and Waldayu Mobile, web and mobile front-ends for endangered language dictionaries. The Waldayu products are designed with the needs of novice users in mind – both novices in the language and technological novices – and work in tandem with existing lexicographic databases. We discuss some of the unique problems that endangeredlanguage dictionary software products face, and ...
متن کاملA Formosan Multimedia Dictionary Designed Via a Participatory Process
Digital archiving is important work for an endangered language, because if an endangered language disappears, associated cultural assets will disappear altogether. Several digital archiving projects are being conducted in Taiwan. Many tribal teachers are now involved in these projects. Based on the needs of these tribal teachers, this paper presents an easyto-use system for digitally archiving ...
متن کاملCreating multimedia dictionaries of endangered languages using LEXUS
This paper reports on the development of a flexible web based lexicon tool, LEXUS. LEXUS is targeted at linguists involved in language documentation (of endangered languages). It allows the creation of lexica within the structure of the proposed ISO LMF standard and uses the proposed concept naming conventions from the ISO data categories, thus enabling interoperability, search and merging. LEX...
متن کاملAutomatically Creating a Large Number of New Bilingual Dictionaries
This paper proposes approaches to automatically create a large number of new bilingual dictionaries for lowresource languages, especially resource-poor and endangered languages, from a single input bilingual dictionary. Our algorithms produce translations of words in a source language to plentiful target languages using available Wordnets and a machine translator (MT). Since our approaches rely...
متن کامل